fix: convergence issue by adding use_inductor=False in vllm compilation_config by ZhiyuLi-Nvidia · Pull Request #1014 · NVIDIA-NeMo/RL

ZhiyuLi-Nvidia · 2025-08-28T17:28:49Z

What does this PR do ?

Closes #998.

Looks like it can be resolved with the compilation flag {"use_inductor": False}.
"With this flag, vllm will use the custom CUDA kernels instead of the Triton kernels generated by torch.compile "which might cause numerical issue here.

There's no logprob error spikes in 140 steps and rewards were increasing stably. The speed performance looks similar.
https://wandb.ai/nvidia/grpo-dev-zhiyul/workspace?nw=nwuserzhiyul

Rewards of 140 steps

logprob error w/ and w/o the change

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

parthchadha · 2025-08-28T17:37:28Z

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well? Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

terrykong

nice find @ZhiyuLi-Nvidia !

is it possible to construct a model diagnostic test for this?

https://github.com/NVIDIA-NeMo/RL/tree/main/tools/model_diagnostics

might be helpful for others who are debugging their model run

ZhiyuLi-Nvidia · 2025-08-28T17:45:53Z

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

…lation_config Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

parthchadha · 2025-08-28T17:49:49Z

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

Let's run qwen 32b from #957 (we can try with 32k osl)

ZhiyuLi-Nvidia · 2025-08-29T18:44:39Z

is it possible to construct a model diagnostic test for this?

https://github.com/NVIDIA-NeMo/RL/tree/main/tools/model_diagnostics

might be helpful for others who are debugging their model run

Good suggestion. Added.

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2025-08-29T22:19:00Z

@terrykong added output example 2ba5e3e

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2025-08-31T21:09:26Z

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

Let's run qwen 32b from #957 (we can try with 32k osl)

@parthchadha I kept get OOM in the middle of training. Shall we go back to it once merged or in a more stable state?

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…on_config (NVIDIA-NeMo#1014) Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…on_config (#1014) Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…on_config (NVIDIA-NeMo#1014) Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia requested review from parthchadha and terrykong August 28, 2025 17:28

terrykong reviewed Aug 28, 2025

View reviewed changes

fix: fix convergence issue by adding use_inductor=False in vllm compi…

b3aae4f

…lation_config Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 6883f11 to b3aae4f Compare August 28, 2025 17:46

github-actions bot added the documentation Improvements or additions to documentation label Aug 29, 2025

add model diagnostics script

f5bf231

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 55191aa to f5bf231 Compare August 29, 2025 18:46

ZhiyuLi-Nvidia added 2 commits August 29, 2025 11:51

add file to pyrefly.toml

3ab1b5b

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

add output example

2ba5e3e

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

remove unused function

37fd92e

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

parthchadha previously approved these changes Sep 2, 2025

View reviewed changes

terrykong added this pull request to the merge queue Sep 2, 2025

ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 2, 2025

enforce_eager defaults to False and add more comments for explanation

7af614d

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia dismissed parthchadha’s stale review via 7af614d September 2, 2025 23:14

terrykong previously approved these changes Sep 2, 2025

View reviewed changes

terrykong added this pull request to the merge queue Sep 2, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 3, 2025

ZhiyuLi-Nvidia dismissed terrykong’s stale review via 1bcb7ae September 3, 2025 17:36

add missing key in grpo_math_1B.yaml

dddcbf0

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 1bcb7ae to dddcbf0 Compare September 3, 2025 17:41

terrykong changed the title ~~fix: fix convergence issue by adding use_inductor=False in vllm compi…~~ fix: fix convergence issue by adding use_inductor=False in vllm compilation_config Sep 3, 2025

ZhiyuLi-Nvidia changed the title ~~fix: fix convergence issue by adding use_inductor=False in vllm compilation_config~~ fix: convergence issue by adding use_inductor=False in vllm compilation_config Sep 3, 2025

terrykong enabled auto-merge September 3, 2025 19:17

terrykong previously approved these changes Sep 3, 2025

View reviewed changes

terrykong added this pull request to the merge queue Sep 3, 2025

ZhiyuLi-Nvidia added 2 commits September 3, 2025 12:53

walk around compilation_config passing issue in asyncLLM

2c69e1b

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

add performance/accuracy analysis

90db3fb

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 3, 2025

ZhiyuLi-Nvidia dismissed terrykong’s stale review via 90db3fb September 3, 2025 20:15

parthchadha previously approved these changes Sep 3, 2025

View reviewed changes

terrykong added this pull request to the merge queue Sep 4, 2025

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 4, 2025

fix: initialize vllm_kwargs with empty dict

1edb177

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia dismissed parthchadha’s stale review via 1edb177 September 4, 2025 16:44

terrykong enabled auto-merge September 4, 2025 20:30

terrykong previously approved these changes Sep 4, 2025

View reviewed changes

terrykong added this pull request to the merge queue Sep 4, 2025

fix tests by adding ALLOWED_ADDITIONAL_CONFIG_KEYS

55bb05c

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 5, 2025

ZhiyuLi-Nvidia dismissed terrykong’s stale review via 55bb05c September 5, 2025 04:41

terrykong approved these changes Sep 8, 2025

View reviewed changes

terrykong added this pull request to the merge queue Sep 8, 2025

Merged via the queue into main with commit 1c85276 Sep 9, 2025
21 checks passed

terrykong deleted the zhiyul/deepscaler_recipe_convergence_fix branch September 9, 2025 00:51

guyueh1 pushed a commit to guyueh1/NeMo-RL that referenced this pull request Sep 15, 2025

fix: convergence issue by adding use_inductor=False in vllm compilati…

44feb2b

…on_config (NVIDIA-NeMo#1014) Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

HeyyyyyyG pushed a commit that referenced this pull request Oct 3, 2025

fix: convergence issue by adding use_inductor=False in vllm compilati…

2b1d579

…on_config (#1014) Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

fix: convergence issue by adding use_inductor=False in vllm compilati…

6e65ee2

…on_config (NVIDIA-NeMo#1014) Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: convergence issue by adding use_inductor=False in vllm compilation_config#1014

fix: convergence issue by adding use_inductor=False in vllm compilation_config#1014
terrykong merged 11 commits intomainfrom
zhiyul/deepscaler_recipe_convergence_fix

ZhiyuLi-Nvidia commented Aug 28, 2025 •

edited

Loading

Uh oh!

parthchadha commented Aug 28, 2025

Uh oh!

terrykong left a comment

Uh oh!

ZhiyuLi-Nvidia commented Aug 28, 2025

Uh oh!

parthchadha commented Aug 28, 2025

Uh oh!

ZhiyuLi-Nvidia commented Aug 29, 2025

Uh oh!

ZhiyuLi-Nvidia commented Aug 29, 2025

Uh oh!

ZhiyuLi-Nvidia commented Aug 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZhiyuLi-Nvidia commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

parthchadha commented Aug 28, 2025

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia commented Aug 28, 2025

Uh oh!

parthchadha commented Aug 28, 2025

Uh oh!

ZhiyuLi-Nvidia commented Aug 29, 2025

Uh oh!

ZhiyuLi-Nvidia commented Aug 29, 2025

Uh oh!

ZhiyuLi-Nvidia commented Aug 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZhiyuLi-Nvidia commented Aug 28, 2025 •

edited

Loading